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Abstract 

In modern molecular biology one of the most common ways of studying a vertebrate immune system 
is to statistically compare the counts of sequenced antigen receptor clones (either immunoglobulins or 
T-cell receptors) derived from various tissues under different experimental or clinical conditions. The 
problem is difficult and does not fit readily into the standard statistical framework of contingency tables 
primarily due to serious under-sampling of the receptor populations. This under-sampling is caused on 
one hand by the extreme diversity of antigen receptor repertoires maintained by the immune system 
and, on the other, by the high cost and labor intensity of the receptor data collection process. In most of 
the recent immunological literature the differences across antigen receptor populations are examined via 
non-parametric statistical measures of species overlap and diversity borrowed from ecological studies. 
While this approach is robust in a wide range of situations, it seems to provide little insight into the 
underlying clonal size distribution and the overall mechanism differentiating the receptor populations. 
As a possible alternative, the current paper presents a parametric method which adjusts for the data 
under-sampling as well as provides a unifying approach to simultaneous comparison of multiple receptor 
groups by means of the modern statistical tools of unsupervised learning. The parametric model is based 
on a flexible multivariate Poisson-lognormal distribution and is seen to be a natural generalization of the 
univariate Poisson-lognormal models used in ecological studies of biodiversity patterns. The procedure 
for evaluating model's fit is described along with the public domain software developed to perform the 
necessary diagnostics. The model-driven analysis is seen to compare favorably vis a vis traditional 
methods when applied to the data from T-cell receptors in transgenic mice populations. 
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1 Introduction 



The major feature of the adaptive immune system is its capacity to generate clones of B and T-cells that 
are able to recognize and neutralize specific antigens. Both cell types recognize antigens by a special 
class of surface molecules called B- and T-cell receptors. The methodology developed in this paper will 
apply to both types of receptors, for the sake of clarity and simplicity we describe the background and 
the overall problem in terms of T-cell receptors. For a general introduction to the molecular biology of 
the immune system, we refer interested reader to e.g., Janeway (20051. 

A single T-cell receptor (TCR) is composed of two chains, a and f3, that are formed during T- 
cell differentiation. Both chains are formed by rearrangements of genetic segments, Va and 3a for 
TCRa chain and V/3, D/3 and J/3 for TCR/3 chain. Since there are a number of segments of each 
type in the genomic DNA , a great number of different a and f3 chains are generated. This chain 
diversity is further increased by the recombination process when individual nucleotides might be added 
or deleted at the junctional sites. The region containing these highly variable junctions is the third of 
three complementarity-determining regions (CDRs) that are seen crystallographically to contact antigen. 
The sources of TCR diversity are thus naturally broken down hierarchically into gene segment family 
(library), segment within family, CDR3 length and CDR3 nucleotide diversity. Both combinatorial 
and insertional re-arrangements result in the huge TCR repertoire ensuring that immune system has a 
potential to recognize a large number of antigens. For instance, it is estimated that in mice the number 



of different TCRs that can be formed exceeds 10 15 (Davis and Bjorkman 



1988 



Casrouge et al. 2000). 



For humans, it is estimated that over 10 18 different TCRs can be produced and the number of different 



TCR species {TCR richness) in a human at any given time has been estimated to exceed 10 7 (Arstila 
et al. 1999| Naylor et al.| [2005 ). This clonal diversity of TCR populations makes them particularly 



challenging objects to analyze statistically. 

In what follows, we are concerned with the statistical analysis of the diversity of TCR repertoire 
samples obtained from various subsets of T-cells as counts of different TCR clones. These T-cell subsets 
are generally defined based on the expression of cell surface markers leading to different functions in the 
immune response. For example, naive T-cells are defined as cells that did not encounter an antigen in 
their lifetime, while memory T-cells are cells that previously responded to an antigenic stimulation and 
underwent clonal expansion. The frequency of individual T-cell clones in normal individuals is very low. 
However, once a naive T-cell expressing appropriate TCR encounters antigen, it becomes activated and 
expands forming multiple clone cells. Thus, the analysis of TCR repertoire clonal size distribution has 
become a crucial element in many studies aimed at better understanding the evolution of the immune 
responses in vaccinated individuals or patients suffering from autoimmune diseases or cancer (cf., e.g., 
Butz and Bevan|fl998 ) . The clonal size estimates has been recently applied to quantify differences in 



TCR repertoires of normal and infected individuals and to help determine which T-cell clones persisted 
as memory T-cells ( McHeyzer- Williams et al.||1999 Sebzda et al.||1999 Busch and Pamer||1999 Savage 



et~al~| |l999). The comparisons of the diversity of TCR repertoires has been also used to determine the 



origin of T-cells and to study heterogeneity of memory T-cells (Sallusto et al. 1999. 2004). For the 
purpose of the current paper, under the term "clonal diversity" we will understand both the TCRs 
clonal size distribution (abundance pattern) as well as their richness (number of different TCRs). 

There are several statistical approaches to assess T-cell diversity based on the TCR clones counts. 
For instance, in the non-parametric approach, Simpson's diversity and Shannon's entropy indices have 



been applied to measure the diversity of collected samples ( Ferreira et al. 2009 Venturi et al. 2008 1 



However, such indexes are typically just summary statistics and, due to frequent data under-sampling, 
provide only limited information about the true diversities of the populations. Another approach seen 
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in immunological studies relies on modeling diversity parametrically assuming that all clonotypes are 
equally represented in the repertoire (Barth et al. 1985 Behlke et al. 1985). The advantage of this 



homogenous model is its computational and conceptual simplicity which contributes to its wide use 



Casrouge et al. 


2000 


Hsieh et al. 


2004 


2006 


Pacholczyk et al. 


2007 



2006). However, this simple model is called into question by the empirical evidence (see e.g., Naumov 



et al. 2003 Pewe et al. 2004) suggesting heavy right tails of the clonal size distributions. To account 



for this heterogeneity, the homogenous model has been expanded with a variety of mixture models, 
typically under the assumption of the Poisson-distribution of the TCR clones. These so-called Poisson 
abundance mixture models ( Chao 2006 ) assume that each TCR variant (i.e., each clone family) is sampled 
according to the Poisson distribution with a specific sampling rate, itself varying according to a prescribed 



parametric (mixing) distribution e.g., exponential, gamma, or lognormal (Ord and Whitmore 1986 



Sepulveda et al. 2009; Buhner 1974). A recent detailed comparative study of Sepiilveda et al. (2009) 



identified one of such models, the Poisson- lognormal mixture (PLN), as particularly well suited for 
modeling clonal diversity. The special appeal of the PLN is in that it may be naturally extended to 
multivariate setting, allowing therefore for the simultaneous analysis of abundance patterns of several 
repertoires (Engen et al. 2002 1. In particular, the bivariate extension of the PLN model may be used to 



derive a class of pairwise dissimilarity measures between repertoires and to construct tree-based hierarchy 
relating various TCR repertoires. 

The purpose of this article is to describe a new general approach to analyzing and comparing TCR- 
type data arriving from multiple repertoires. The approach we develop here relies on the TCR reper- 
toires dissimilarities analysis where the appropriate tree-distance measures are derived under a simple, 
easily testable and interpretable model of clonal abundance based on the parametric bivariate Poisson- 
lognormal distribution (BPLN). By means of examples derived from real TCR data, we argue that 
under BPLN both the moment-based and the information-based parametric measures of dissimilarity 
yield consistent and biologically meaningful results. The paper is organized as follows. In the next 
section (Section [2]) we give a brief overview of the Poisson abundance models both in univariate and 
multivariate (bivariate) settings. In Section 3 we discuss one popular method for deriving dissimilarity 
measures which is particularly relevant for TCR data studies and provide the formal definitions of the 
four different measures considered in the paper. In Section 4 we present the application of our method 
to TCR data obtained from the populations of naive and so-called regulatory T-cell receptors in healthy 
and immune-deficient mice. This data set was described in detail and analyzed by different methods in 



Pacholczyk et al. ( 2006 ). We re-analyze it using clustering algorithms derived under both non-parametric 
and BPLN models and compare the results. In Section 5 we provide a concise summary of our findings 
and offer some concluding remarks. Some elementary derivations related to the entropy function are 
provided for readers convenience in the appendix. 



2 Poisson Models of Abundance 



Poisson abundance models arrive naturally in the biodiversity studies if we assume (see, e.g., Chao|| 2006 ) 
that the species sampling is done by a "continuous type of effort" i.e., data is recorded as arriving from 
a mixture of Poisson processes in time interval from to T (in what follows we take T = 1). This type 
of model approach can be traced back to Fisher, Corbet and Williams ( Fisher et al. 1943 ) . Consider M 
species labeled from 1 to M. Individuals of the i-th species arrive in the sample according to a Poisson 
process with a discovery rate A^. If the detectability of individuals can be assumed to be equal across 
all species (which is typically a case in TCR repertoires analysis), then the rates can be interpreted as 
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species abundances (Nayak 1991 ). In this sampling scheme, the sample size n (the number of individuals 
observed in the experiment) is a random variable. Since the conditional frequencies follow a multinomial 
distribution with class total n and class probabilities given by relative frequencies Xj/Y], , Xk, many 
estimators are shared in both the continuous-type Poisson models and the discrete-type (multinomial) 
models where n is assumed to be a constant. We note here that in case of antigen receptors data, the 
constant n is sometimes known (e.g., DNA sequencing data) and sometimes not known (e.g., spectratype 
data, see, I Kepler et al.||2005| . In the latter case the prior distributional form of n is typically assumed 



and the posterior model is investigated based on the Bayesian tools (see, e.g., Rodrigues et al. 2001 
Lewins and Joanes 1984 Barger and Bunge 2008 Solow 1994 ). Since the present paper is motivated 
by the single-cell DNA sequencing data, we are assuming throughout that n is known. The extension 
of our model to unknown n along the lines of Rodrigues et al. ( |2001 ) is reasonably straightforward but 
not pursued here. 



2.1 Univariate mixture models 

Since it has been generally accepted that the antigen receptor clonal size distributions have heavy right 
tails, to adjust for the over-dispersion the species rates (Ai, A2, • ■ • , Xm) are typically modeled as a 
random sample from a mixing distribution with density f(X;9), where 9 is a low-dimensional vector 
of parameters. Following the famous paper by Fisher and his colleagues (Fisher et al. 19431 many 



researchers have adopted a gamma density as a mixing model. Other parametric models include among 
others the log- normal (Bulmer 1974), inverse-Gaussian (Ord and Whitmore 1986[), and generalized 



inverse-Gaussian ( Sichel 1997 ) distributions as well as many others ( Sepiilveda et al. 2009 ) . An obvious 



advantage of such parametric models is that the inference problem reduces to estimating only a few 
relatively low-dimensional parameters for which the traditional estimation procedures can be typically 
applied. For any mixture density f(X;8), define pg(k),k = 0,1,... as the probability that any TCR 
species is observed k times in the sample, that is 



Pe(k) 



[X k e- x /k\]f(X:9)dX k = 0, 1, 



(2.1) 



Denoting by f k (k = 1,2, 
we have E(f k ) = Mp e {k). 



. . , n) the number of receptor species observed exactly /c-times in the sample 
Setting D = ^ fe fk the likelihood function for M and can be written as 



L(M,e\{f k }) 



Ml 



(M - D)\U k>1 (fkl) 



MO) 



M-D 



k>\ 



The (unconditional) MLEs for M and and their asymptotic variances are obtained based on the above 
likelihood which as we can see depends on the data only through the observed values of {fk}- The 
likelihood can be factored as 

L(M, e\{f k }) = L b (M, d\D)L c (0\{f k }, D) 

where £&(M, 6\D) is a likelihood with respect to D, a binomial (M, 1— pe(0)) variable, and L c (6\{fk}, D) 
is a (conditional) multinomial likelihood with respect to {fk] k > 1} with cell total D and zero-truncated 
cell probabilities pg(k)/[l — pe(0)], k > 1, i.e., 



L c (8\{f k },D) = 



D\ 

IW/*0 



n 

k>l 



Pe(k) 



(2.2) 
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The MLE obtained from this likelihood can be regarded as a (conditional) empirical Bayes estimator 
if we think of the mixing distribution as a prior distribution having unknown parameters that must be 
estimated. See, e.g., Rodrigues et al. (2001) for further reference. 



2.2 Extension to bivariate models 

As we shall see in the next section, it is of interest to also consider multivariate models of abundance. 
For the purpose of our discussions below we focus on the bivariate models but the modifications for 
higher dimensions are rather straightforward. For simplicity assume that we have the same M species in 
both populations. In direct analogy with the notation of the previous section, define now pe(k, I) to be 
the probability that any TCR species (i.e., a TCR clone) is present k times in the sample from the first 
population (repertoire) and I times in the sample from the second one. Let fk t i be the empirical count 
and set now D = J2k i>o f*>,l (assuming / ,o = 0). Let f(\\, A 2 , 9) be the bivariate mixture distribution. 
The likelihood formulae from previous section extends to the bivariate case as 



Ml 

k,l>0 

where 

/■OO 

Pe (k,l)= [A 1 e- Al A^e- A2 /(fc!n)]/(Ai,A 2 ;6»)dA 1 dA 2 fc,/ = 0, 1, (2.3) 

Jo 

Note that, as before, E(fk,i) — Mp$(k, I). The likelihood function for M and 6 can be again factored as 

L(M, 6\{f k j}) = L b (M, 0\D)L c (6\{f ktl }, D) 



where, in obvious analogy with \2.2\ 1 Lb(M,0\D) is now a likelihood with respect to D, a binomial 
variable with parameters (M, 1 — p#(0,0)) and L c (9\{fk } i}, D) is a (conditional) multinomial likelihood 
with respect to {fk,i,k + I > 0} with cell total D and the bivariate, zero-truncated cell probabilities 
{p e (k, - Pe(0, 6)]}fc,i, k + I > 0, i.e., 



L c (e\{f k!l },D)= Dl TJ 

11&J>0 Jk,i- 



k>l 



Pe(k,l) 
1-Pe(0,0) 



(2.4) 



3 Diversity Analysis and Clustering 



When studying evolution of TCR species it is of interest to compare their diversity, by which we mean 
herein (cf. Section 1) the clonal size distribution {pg(k)} and the species number M. Such repertoire 
diversity comparisons are of great interest for instance in clinical studies, where the quantities of interest 
are the "divergences" of multiple observed TCR repertoires from some control or asymptotic one. The 
individual repertoires of antigen receptors can be then characterized in terms of their divergence from 
the control ( |Chen et al.] |2003| |Komatsu et al.[|2009||Pacholczyk et al.||2007| 12006) . Under our definition 
the TCR repertoire diversity is completely determined by the parameters (M, 8) . This agrees with 
the original concept of "species diversity" known from the field of ecology where the term itself relates 
both to the number of species (richness) and to their apportionment within the sequence (evenness 
or equitability, see Sheldon 1969 ) . A sensible method of comparing diversity of multiple repertoires 



simultaneously is based on a concept of (pairwise) diversity dissimilarity measure and the hierarchical 
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clustering induced by it. The hierarchical clustering which we discuss in more detail below, is one of 
many modern methods of analyzing patterns in high-dimensional data on the grounds of the so-called 
unsupervised statistical learning theory (cf., e.g., Hastie et al.|200l" |, a very dynamically developing area 
of modern statistics. 



3.1 Diversity dissimilarity measures 



Assume that the overall "similarity" between a pair of TCR repertoires with respective clonal abundance 
distributions p and q is quantified by some non- negative function Q(p, q) referred to as the similarity index 
or similarity measure. Since typically the samples from the joined distribution (frequency) of abundance 
are not available in data collected from TCR repertoires, some of the crude similarity indices are based 
simply on the joined TCR species presence/absence data, i.e., the number of TCR species shared by two 



samples and the number of species unique to each of them (see discussion in Legendre and Legendre 
1998). Examples of such indices are the classical Jaccard index and the closely related S0rensen index 



the two oldest and most widely used similarity indices in ecological biodiversity studies ( Magurran 2005 ) 



One of the advantages of the Poisson mixture model described in the previous section, is that it allows 
for defining meaningful indices incorporating pairwise comparisons of the TCR species based on joined 
distribution of abundance. As representative examples of such indices we consider here the Morisita- 
Horn index (T>mh) ( |Magurran{ 2005), the mutual information criterion (T>mi) which is a special case of 
a Kullback-Leibler divergence (see, e.g., Koski 2001 ), and the overlap index (T>ov) introduced by Smith 
et al. ( 1996 ). All of these indices give rise to the corresponding measures of dissimilarity concentrated on 



the unit interval with the perfect correlation (or complete overlap) between the frequency distributions 
yielding the value zero. Indeed, they are all seen as special cases of the following general construction. 
For any bivariate probability distribution pg with corresponding marginal distributions p s 
consider a similarity index Q{p g 1 \p g 2 ^) satisfying 



and 



,( 2 ) 



.(2) 



with the right bound attained when 

measure of dissimilarity between the pair (p*g~\p^) may be defined as 



(3.1) 



Then the corresponding (normalized) Q-induced 



To obtain the Morisita-Horn dissimilarity index (T>mh) we take in ( |3.2[ ) Q = Qmh 

QMH(p e 1) ,P { e ) )= E klp„(k,l). 



(3.2) 



(3.3) 



fc,/>i 



which obviously satisfies (3.1 ) for any non-negative random variables. A closely related correlation-based 
dissimilarity index T> p is obtained when we take Q — Q p with 



k,l>0 



klpe(k, 



(3.4) 



where k — (fc — mx)/si and I — (l — m?)/s2 and rrij and Si are, respectively, mean and standard deviation 



of 



Po 



W A _ 



1,2. In this case the inequality (3.1) simply asserts that < Q p < 1. 
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A popular dissimilarity measure which we shall denote here by T>ov is obtained by averaging con- 
ditional probabilities of individual receptor species presence in both samples, given its presence in one. 
The measure was introduced by ISmith et al. ( 1996 ) to quantify 'overlap' between repertoires and is 



obtained by taking the following similarity index. 



2 



X 



fe>0 



1 



.,(2) 



Z>0 J 



(0, 



(3.5) 



which again trivially satisfies (3.1) 



Finally, in order to obtain the mutual information dissimilarity index (T>mi) we take in (3.2) Q 
Qui where 

pe(k,l) 



QMi{p ( e\pt ] )= £ MM) log 



{k) P f\l) 



(3.6) 



The fact that the above function satisfies ( |3.1[ ) is shown in the appendix. Note that all the above 
similarity indices Qmh, Qp, Qov and Qmi (and thus also their corresponding dissimilarity measures) 
depend on the underlying mixing distribution parameters 9 but not explicitly on the species number M. 
This is desirable since the quantity M is typically unknown and difficult to estimate due to often very 
severe undersampling of the TCR repertoires (cf., e.g., Sepulveda et al.|2009 ). In the parametric setting 
considered here, if needed, the value of M may be estimated (a-posteriori) by either of the estimates 



Mi =£>/(l-p e (0,0)) 
M 2 = h,l/Pe(k,l). 

fe,/>0,fc+/>0 



(3.7) 
(3.8) 



whose close numerical agreement usually indicates a robust fit of the bivariate parame tric mixture 
model. Note that M2 is simply a parametric version of the Horvitz-Thompson estimator (Horvitz and 



Thompson 1952). 



3.2 Hierarchical clustering 

For a given pairwise dissimilarity measure T> of TCR repertoires, it is a standard unsupervised statis- 
tical learning approach to simultaneously compare N repertoires in terms of T> by means of building 
hierarchical clusters which are graphically represented by dendrogarms or "tree diagrams" . In such hi- 
erarchical clustering procedure the TCR data are not partitioned into a particular cluster in a single 
step, instead, as a name suggests, a hierarchical structure is produced in which the clusters at each 
level of the hierarchy are created by merging clusters at the next lower level. The main advantage of 
hierarchical clustering approach lies in the fact that no cluster number needs to be specified in advance. 
Hierarchical clustering is performed via either agglomerative methods, which proceed by series of fusions 
of the N objects into groups, or divisive methods, which separate objects successively into finer group- 
ings. Agglomerative techniques are more commonly used, and this is the method we consider below for 
TCR repertoires. For a general introduction to clustering and unsupervised learning, interested reader 



is referred to chapter 14 in the popular monograph |Hastie et al. (2001). 



3.3 Poisson-lognormal model 



Whereas there are many possible models of parametric bivariate mixture, the recent studies in |Engen| 
et al. ( 2002 1 and Sepulveda et al. ( 2009 ) seem to indicate that lognormal mixing distributions may be 
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often an appropriate choice for TCR repertoires modeling. In that spirit we consider herein a bivariate 
model based on log-binormal variates. Under the assumption of random sampling, the number of 
individuals sampled from a given receptor species with abundance A is Poisson distributed with mean A. 
If we assume that In A is normally distributed with mean p and variance a 2 among species, then the vector 
of individuals sampled from all M species constitutes a sample from the Poisson lognormal distribution 
with parameters 6 = (p,a 2 ), where p and a 2 are the mean and variance of the log abundances. The 



corresponding mass function is of the general form (2.1 ) and may be written as 



p(k;p,a 2 ) 



g k {p,a,u)(j)(u)du 



(3.9) 



where <j){-) is a standard normal density function and 



gk(p,<r, u) 



exp 



[uak + pk + e -(« CT +M)] 



fc! 



k > 



is the re-parametrized Poisson distribution. Similarly, when we consider pairs of counts of individual 
receptors from two different repertoires we may think of them as a random sample (of size M) from the 
bivariate Poisson-lognormal distribution (BPLN) with probability mass function given as in ( ]2.3[ ). That 
is, we assume that the log abundances among species have the binormal distribution with parameters 



(Mi) A*2j 01) 02) p)- F° r a general detailed description of the multivariate Poisson distributions, see Aitchi- 
|son and Ho (1989). Let <f)(u, v; p) denote the normal bivariate density with correlation coefficient p, zero 



means and unite variances. The distribution of BPLN is given in terms of the bivariate probability mass 
function p$(k, I) = p(k, /; pi, /i 2 , a 2 , a\, p) for fc, I > where 



p(k,l;p 1 ,p 2 ,al,al 1 p) 



9k (Pi i 01 1 u )di (a*2 , 02 , v)(p(u, v; p)du dv. 



(3.10) 



— oo J — oo 



From the above formula it follows in particular that both marginals of BPLN are the univariate Poisson- 
lognormal distributions (3.9 ) with respective parameters (/Ltj, of) (i — 1, 2). Since M is usually unknown, 



when fitting the model we only consider the number of individuals for the observed receptor species and 
thus the distribution of the number of observed individual receptors follows the zero-truncated BPLN 
distribution with probability mass function 



p(fc,/;/ii,/i 2 ,o- 2 ,cr|,p) 
1 -p(0, 0;pi,p 2 ,v 2 ,v 2 ,p)' 



(3.11) 



The maximum-likelihood estimators (MLEs) of the parameters of this distribution were discussed 
e.g., in Bulmer (1974) and more recently in Karlis (20031 and Engen et al. (2002 1. The latter approach 



is conveniently implemented in the freely available R package poilog (|R Development Core Team 2009) 



which we have used in the current paper to perform all the necessary parameter fitting. In our setting, 



the model parameters were calculated from the multinomial conditional likelihood function (2.4) where 
the truncated probability quantity p$(k, Z)/(l — Pe{0, 0)) is given by (3.11 ). 



Under the assumed BPLN model the measures of dissimilarity may be computed either directly or 
by Monte-Carlo approximations. Denoting the means of the BPLN marginals by 



= exp[/i 4 + cr 2 /2] i = l,2, 



(3.12) 



the moment-based dissimilarity measures T>mh and T) p are given by 



S 



V 



2aia 2 exp[/0<ri<72] 



V p = l- 



'' " ai(l + ai expfof]) + a 2 (l 

2aia 2 (exp[pcri<7 2 ] - 1) 



a 2 exp[cr|]) 



where in (3.14) the quantities 



A/ai(l + ai(exp[cr^] - l)) v /a 2 (l + a 2 (exp[o-|]) - 1) 



(3.13) 
(3.14) 



a^l + a^exp^ 2 ] - 1)) i=l,2 (3.15) 

are seen to be the marginal variances of the BPLN distribution. The formulae ( 3.13||3.14 | provide 
for a convenient way of estimating the dissimilarities T>mh and T) p from the data simply by replacing 
the unknown distribution parameters by their sample estimates calculated via maximum likelihood, for 
instance, by using the numerical algorithms implemented in the "poilog" R-package. 



Unfortunately, due to the fact that the mass function (3.10 1 is not available in a closed form, there are 
no similar formulae for the indices T>mi,T^ov ■ These indices are typically approximated by the Monte- 
Carlo-based bootstrap procedures which are also used to derive confidence intervals and standard errors 
of the parameter estimates in the model. The general justification for such derivations is provided 
e.g. ' " 



Gill et al. (2009) and Rempala and Szatzschneider (2004). The reader is referred to Efron and 



Tibshirani (1997) for a general introduction to the theory and practice of statistical bootstrap which 



is the computational technique we rely upon heavily in our modeling approach and the data analysis 
described below. 



4 Application to TCR Repertoire Data 

In this section we illustrate the parametric inference on T-cell receptor data based on BPLN model 
and the associated pairwise dissimilarity measures by analyzing diversity of the repertoires of T-cell 
receptors derived from two types of genetically-engineered (TCR-mini) mice: the "wild" type with 
genetically restricted TCR repertoire and unaltered repertoire of self antigens bound to class II MHC, 
and the "Ep" (B63VJEp) mice that in addition to restricted TCR repertoire also express only single, 



covalently linked to MHC Ep peptide (Pacholczyk et al. 2006). 



4.1 Data description and processing 

The description of the transgenic "TCRmini" mice modifications was already detailed in the recent 
work Pacholczyk et al. ( |2007 ). Herein, we mention only briefly that the "TCRmini" mice is a new 



generation of TCR transgenic mouse where all T-cells express one pre-specified TCR/3 chain (specifically, 
the chain V/314D/32J/32.6), and the unique TCRa mini-locus. This mini-locus allows only for restricted 
rearrangements of a single Va2.9 segment to one of the two Ja (Ja26 and Ja2) segments. These mice 
have no other loci that encode TCRa chains and therefore their entire diversity of TCRs is derived from 
the artificially introduced TCRa mini locus, resulting in greatly altered TCRs abundance. 

For the purpose of testing our statistical model, two subpopulations of CD4+ T-cells (i.e., T-cells 
expressing a surface marker CD4) were collected representing, respectively, regulatory (TR) and naive 
(TN) T-cells (where the TR cells are defined as those expressing the additional marker Foxp3). In 
addition, these two subpopulations of CD4+ T-cells were isolated either from (1) the peripheral lymph 
nodes or (2) from the thymus, giving us a total of eight TCR populations differing by the animal type, 
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-4.46 


-3.81 


Up 


-3.31 


-2.43 


-2.40 


-2.54 


-2.59 


-2.13 


-2.38 


-1.87 


a 2 


2.03 


1.93 


1.66 


1.70 


1.99 


1.48 


1.84 


1.31 


Lo 


1.60 


1.52 


1.17 


1.21 


1.61 


1.15 


1.46 


0.90 


Up 


2.36 


2.29 


1.99 


2.02 


2.47 


1.74 


2.15 


1.70 



Table 1: Maximum likelihood estimates of the means and variances of the Ep (TCR-restricted) and Wt 
(wild-type) mice repertoires along with bias corrected 95% confidence bounds generated via parametric 
bootstrap. 



T-cell type, and tissue location. In the thymus, gross of CD4+ T-cells undergo development and in the 
lymph nodes CD4+ T-cells arc retained unless activated by specific antigen and therefore two markedly 
different patterns of clonal abundance in these organs are generally expected. Additionally, because in 
the "Ep" mice express only single class II MHC/peptide complex, the diversity of CDR3 region of TCRo 
chain is drastically reduced in comparison with TCR-mini "wild" mice. For these reasons the dataset is 
uniquely suitable for testing the BPLN model. 

The TCR data from both types of mice was collected as follows (for more details on a similar 
data harvesting procedure, see also e.g., Warren et al. (2009); Freeman et al. (2009)). Using specific 



fluorescent reagents, populations of T-cells from different organs were separated into individual wells on 
96-well plates and amplified in their unique CDR3 regions of their TCRo chain using single-cell RT- 
PCR (see, e.g., Kuczma et al.|2009 |. Following this amplification, the CDR3 regions were sequenced and 
analyzed providing the distribution of these regions in native subpopulation(s) of T- lymphocytes. This 
type of procedure has been widely considered to be one of the most reliable methods to harvest T-cell 



repertoires (Luczynski et al. 2007). Single T-cells can be separated from cell suspension or be isolated 



from tissue sections. Both the f3 and the a chains can be amplified and sequenced to provide unambiguous 
identification of T-cell clones. This method avoids the problems of skewed PCR amplification and varying 
TCR mRNA expression in different cells. Its obvious drawback is the under-sampling issue alluded to 
already in Section 1: a very large number of cells need to be analyzed to ensure detection of rare clones 
and to provide a global representation of the T-cell repertoire. We note that with the availability of 



the next generation sequencing technology (Wong et al. 2007) the large number of single cell RT-PCR 
could be replaced with the high throughput PCR from heterogeneous population of T-cells. However, 
at its current stage the technology is not yet recommended for repertoires analysis due to difficulties of 
matching specific TCRo and TCR/3 chains when amplified simultaneously. In addition, there is also a 
high risk of counts bias due to the skewed amplification process in high throughput data which tends to 
over-express the most dominant DNA sequences and under-express (or even remove) the rare ones. 



4.2 Analysis under BPLN model 

In order to assess the usefulness of a proposed parametric method of TCR data modeling, we have first 
generated the BPLN model-based estimates of the dissimilarity matrix for the eight TCR repertoires 



using separately each of the four dissimilarity measures V of the general form (3.2 ) described in Section 3 
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For the moment-based dissimilarities T>mh an d T> p we have used the explicit formulae (3.131 and (3.14) 
whereas for the remaining two measures T>ov and T>mi we have directly approximated the quantities 
p(k, I; fii, /j,2, <j\ , erf, p) given by (3.10) and subsequently used the parametric bootstrap procedure to 
produce the empirical approximations of dissimilarities. In all cases the BPLN distribution parameters 
were estimated by the maximum likelihood estimators (MLEs) computed by maximizing the conditional 
multinomial likelihood function (2.4) based on the zero-truncated probabilities (3.11). In principle, one 
could use the conditional likelihood of multivariate Poisson-lognormal distribution directly to estimate 
all of the parameters simultaneously, however, due to the complicated form of the resulting mixture 
probabilities, we have deemed that approach to be too unreliable numerically. On the other hand, the 
iterative bivariate model fitting (fitting one BPLN model at a time, conditionally on the remaining 
repertoires and iterating until convergence) was seen to be a reasonable fast and numerically stable 
procedure yielding a set of estimates consistent with marginally MLE-fitted parameters, regardless of 
the order in which conditioning was performed. 

The results of the conditional MLE procedure are partially summarized in Table[l]where the estimates 
of the p and a 1 parameters for Poisson-lognormal abundance distributions for the eight repertoires are 
reported along with the corresponding confidence intervals obtained via the parametric bootstrap bias- 
corrected percentile method (see e.g., Rempala and Szatzschneiderj|2004 for details on bootstrap-based 
interval estimation) . Note that the parameters p and a 1 are related to the marginal estimates of means 
and variances of the Poisson-lognormal variates by the formulae (3.12) and (3.15), respectively. In order 
to conserve space, the estimates of the BPLN correlation coefficients are not shown as they are similar 
in relative values to the moment-based dissimilarities summarized below in Figure [2j We note that 
the marginal values of the repertoire-specific parameters in both types of repertoires (wild-type and 
Ep) were found to be of similar magnitude (with estimated values of p parameters between -4.5 and 
-2.9 and a 2 parameters between 1.3 and 2.) Overall, the numerical values of the parameters indicated 
smaller Poisson-lognormal means for the restricted-repertoire mice as compared with the wild-type. 
Additionally, the naive T-cell repertoires generally seemed to have smaller means and larger variances 
of the mixing log-normal distributions then the regulatory T-cell repertoires. 

The goodness of fit statistics were calculated for the bivariate marginal fit via the bootstrap distri- 
bution of the conditional likelihood statistic (2.4). In all cases the differences between the bivariate data 
and fitted models were not-significant (all p- values <.05) as measured by the bootstrap tests, indicating 
reasonable good fit of the parametric distributions to the (zero-truncated) abundance data. In addition 
to the goodness-of-fit testing, we have also performed qualitative comparisons of BPLN model versus 
data via smoothed heat- map plots ( Anderes et al. 2009[ ) . One example of such comparison is provided in 



Figure [T] where the smoothed heat- map illustrates both true and model-generated bivariate abundance 
distributions of wild-type receptor repertoires: naive TCRs from lymph nodes and regulatory TCRs 
from thymus (i.e., Wt TNI and Wt TR2). 

One of the major advantages of the parametric BPLN model is that in many practical situations of 
interest it produces, across a number of commonly used dissimilarity indices of the form (3.2 ), hierarchical 
clustering models which are consistent in terms of the final clusters composition. That is, unlike in the 
non-parametric analysis, for many TCR datasets the choice of the dissimilarity measure T> is largely 
irrelevant when modeling repertoires with BPLN model. This is not surprising since our interpretation 
of the dissimilarity measure is that it accounts for differences in diversity (0,M) between pairs of 
repertoires. Under the condition (which is often satisfied, see Table [lj that a± s» a 2 and <j\ s» er 2 one 
may show that in the BPLN model the dissimilarity is a monotone function of the parameter p and 
hence measures the correlation between repertoires. This is indeed the case for all indices V discussed 
in Section 3 above. 
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The results of hierarchical clustering analysis of the eight mice TCR repertoires under T>mh and T>mi 
are presented in Figure [2j The figure's left panels (top and bottom) show the dendrograms obtained 
by agglomerative hierarchical clustering with a complete link function using T>mh (top) and T>mi (bot- 
tom) as the dissimilarity measures. Both dendrograms indicate a very good relative agreement of the 
cluster hierarchical structure and the correct final classification of the eight repertoires in terms of the 
experimental condition (restricted vs wild-type) as well as the repertoire type (naive or regulatory) and 
tissue type (thymus vs lymph nodes). Almost identical dendrograms (not shown) were also produced by 
applying the remaining two measures discussed, namely T> p and T>ov- The central panels in F igure [2 



illustrate the bootstrap approximations to the distribution of the Frobenius distance (see, e.g., Golub 
|and Van Loan| 1996 for more on matrix distances) of the dissimilarity matrix \PMH{i,j)\ (top) and 
j)] (bottom) (1 < i, j < 8), with 95% confidence bound marked with vertical lines. The right 
panels show the dendrograms corresponding to the dissimilarity matrices at the upper bound of the 
corresponding 95% confidence intervals (i.e., right vertical lines of the central panels). The fact that the 
left and right dendrograms in top and bottom panels have identical relative hierarchies indicates strong 
robustness of the hierarchical clustering against the fluctuations of both T>mh and T>mi- The similar 
robustness was also true for T> pi and T>qv- This agreement between the dissimilarity matrices entries 
generated under BPLN model using four different dissimilarity measures T>mh, Dp, T>mi and T>ov is 
further illustrated in Figure [3] where the pairwise loess regressions ( [Cleveland and Devlin 1988 ) of the 
entries of the dissimilarity measures on each other are presented indicating their monotone relationship 
(almost a linear one between the first three measures). As already indicated, such agreement should be 
expected if the data closely follows the BPLN model. 



As our final analysis for BPLN model, we have also computed the species richness estimates (3.7) 



and (3.8) for each pair of repertoires (i.e, each in the total of 28 comparisons) and averaged the result to 
obtain a pooled estimator of the ratio D/M which was found to be .09 with the 95% confidence interval 
of (.06, .11). These values suggest a much more severe under-sampling of the TCR populations then 
the traditional non-parametric estimator of Good (19531 given by fi/D « .25 (where now fx and D are 
computed from pooled repertoire data). 



4.3 Analysis under non-parametric model 

In order to further examine the results of our parametric hierarchical clustering analysis of the TCR 
repertoires, we have also performed the more traditional, non-parametric hierarchical clustering analysis 
of the repertoires in which we have estimated the values of the two dissimilarity measures T>mh and T>mi 
with the non-parametric estimates based directly on the sample data. Note that T>mh is particularly 
convenient to analyze non-parametricaly as it only requires the relative estimates of the mixed and 
marginal moments of order two, which may be calculated directly from the observed (zero-truncated) 
joined abundance. For that reason, the parametric and non-parametric measures T>mh may be directly 
compared with each other. On the other hand, the information-based measure T>mi may be estimated 



by means of a recently popularized Chao-Shen estimator ( Chao and Shen 2003 1 which, in a manner 
similar to the ACE and Horwitz-Thomson estimates (see, e.g., Chao 2006), attempts to adjust explicitly 



for the fact of only observing the truncated joined distribution. 

The result of the direct comparison between all the pairwise estimated T>mh values under the BPLN 
and the non-parametric models is presented as a scatter plot with fitted loess trend-line in Figure [4] The 
plot follows a linear trend indicative of a very close agreement between non-parametric and parametric 
Dmh values for our TCR data. This apparent almost linear relationship between the dissimilarities 
estimated under the two measures is also confirmed by the dendrogram computed under non-parametric 
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Pmh (top panel in Figure [5]) which gives a stable set of hierarchical clusters almost identical to those 
obtained under BPLN model (top panel in Figure [2]). 

In contrast to the parametric case, we found that the non-parametric analogue of the mutual in- 



formation (MI) dissimilarity (3.6), based on the coverage adjusted Chao-Shen entropy estimator (Vu 



et al. 2007), did not agree with the non-parametric Morisita-Horn (MH) dissimilarity and consequently 



yielded a very different and biologically uninterpretable TCR clustering which lacked separation between 
the wild-type and Ep mice. These differences between non-parametric MH and MI measures may be 
clearly seen in the two top panels of Figure [5] where also the lack of stability of the Mi-dissimilarity 
is clearly manifested by the large difference between dendrograms within the 95% confidence bound 
induced by the Frobenius norm of the Mi-dissimilarity matrix. Note that this is not the case for the 
MH dissimilarity, which appears quite stable. 

The lowest third panel of Figure [5] illustrates a different non-parametric dissimilarity analysis we have 
performed, based on the coverage adjusted estimated values of the Shannon entropy function for the 
eight repertoires (see e.g., Vu et al.|2007[ and formula ( A. I ) in the appendix). The pairwise dissimilarities 
were computed as absolute differences between the estimated entropy values. Such "linear" comparisons 
of the diversity measures across repertoires are often appropriate when the repertoires are assumed to 
have similar abundance patterns (see e.g., Sepulveda et al. 2009). However, as we may see from the 



plots, for our datasets the entropy-based clustering turned out to be only partially satisfactory, as the 
entropy measure was only able to clearly separate two repertoires derived from the wild-type regulatory 
cells but not the remaining ones. Overall, it appears that for our dataset the entropy based clustering 
both via Ml and the Shannon entropy performed poorly whereas the MH clustering was satisfactory and 
comparable (in fact almost identical) with our parametric results. These large differences seems to be 
at least partially due to the fact that, unlike the MH dissimilarity estimate, the non-parametric entropy 
estimates are adjusting for the zero-truncation of the observed distribution and these adjustments in 
our particular dataset seem to fluctuate widely from repertoire to repertoire. 



5 Summary and Discussion 

We have presented a simple bivariatc-Poisson-lognormal parametric model for fitting and analyzing TCR 
repertoires data which may be regarded as a natural multivariate extension of a Poisson abundance model 
with lognormal mixing distribution which has been applied in TCR modeling perviously. As seen from 
our example of analyzing TCR naive and regulatory repertoires in the wild-type and TCR-restricted Ep 
mice, the model provides for a robust and consistent fit to the TCR data allowing for a very detailed 
yet relatively simple comparison of multiple repertoires, for instance by means of dissimilarity analysis 
and hierarchical clustering. We show that in our TCR dataset under the parametric model the four 
popular dissimilarity indices: (i) Morista-Horn and (ii) correlation indices, (iii) mutual information and 
(iv) the overlap dissimilarities are monotone functions of each other, leading therefore to the same, 
biologically meaningful, clustering of the repertoires. In contrast, when applying to the same data the 
non-parametric estimates of dissimilarities, the clusterings are seen to behave erratically and are highly 
dependent on the particular choices of measures with some of them (e.g., the mutual information) yielding 
biologically implausible clustering. 

Our parametric analysis suggests that in a typical experiment based on harvesting sequences from 
single-cell TCRs, the overall under-sampling of the TCR population is much higher than in the macro- 
scopic biodiversity studies, for which many of the statistical tools of species abundance comparison were 
originally developed. This and the apparent lack of agreement between the non-parametric dissimilarity 
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measures when applied to our, relatively simple, dataset, seem to indicate that many commonly used 
non-parametric biodiversity statistics may perform poorly when applied to severely under-sampled TCR 
repertoires. The advantage of the parametric approach and in particular the model proposed here is that 
even with very severe data under-sampling it allows for the proper adjustment for the missing abundance 
information and estimation of the full set of repertoires features. The statistical package poilog available 
from CRAN archive (http://creai.r-project.org) makes the fitting of our model particularly con- 
venient by providing numerical algorithms for the parameters estimation via the maximum likelihood. 
Further studies and a larger number of TCR datasets with more sequences are needed in order to more 
comprehensively evaluate the BPLN model and test its ability to discriminate between TCR repertoires 
in a biologically meaningful way. 



A Appendix: Mutual Information Bounds 



The fact that the bounds (3.1| hold for the dissimilarity index (3.6) follows from the general properties 
of the Shannon entropy function which is defined (see e.g., Koski||200T I for any discrete random vector 
X with probability distribution p(x) as 

H{X) = -^2p(x)logp(x), (A.l) 

X 

with the summation is taken over x values for which p{x) > 0. Extending the definition of the index 
(3.6) to any pair of discrete real random variables X, Y with joined distribution p(x,y) and marginals 
p(x),p(y), we define their mutual information as 

= -^2p(x)logp(x) - ^2p(y) log p{y) + p(x,y) logp(x,y) 

x y x,y 

= H{X) + H(Y)-H(X,Y) 
Due to the elementary inequality log (a;) < x — 1 valid for any x > we have that 

-MI(X,Y) = log f P -^) < - fp{x)p{y) 

^ v p(x,y) J ^ 



,p(x,y) 



p{x,y) 



and therefore 



MI(X, Y)>0 



(A.2) 

1 ) =-X>(*My) + i = o 

(A.3) 



Note that MI{X, X) = H(X) and therefore (due to symmetry and ( A.2 )) to argue upper bound in (3.1 ) 
it suffices to show that 

H(X,Y)>H{X). (A.4) 

This follows easily, since 

'p(x,vY 



H(X,Y) - H(X) = ^2p(x,y) log p(x,y) - ^2p(x)logp(x) = -^p(x,y)log 



p(x) 



> 0. 



The bounds ^$Aj for MI(X,Y) follow now from (|AT3|)-(|AT4|) and ( |AT2] ). The measure and (Q is, of 
course a special case of MI(X, Y). 
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Figure 1: Model generated vs observed pairs of frequency data. Left panel: kernel-density- 
smoothed heatmap of joined frequency data of lymph nodes naive (x-axis) and thymus regulatory (y-axis) 
TCR repertoires in wild mice (Wt.TNl and Wt.TR2 ). Right panel: kernel-density-smoothed heatmap 
for the same-size sample simulated from the distribution of BPLN random variable fitted to the data. 
Increased brightness indicates higher frequency. 
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Figure 2: Repertoire dendrograms and their confidence bounds obtained under BPLN model. 

Dendrograms for hierarchical clustering and their corresponding confidence intervals obtained using ag- 
glomerative clustering and a complete link for eight repertoires of naive and regulatory TCRs derived 
from (1) the lymph nodes and (2) thymus in wild type and TCR mini mice. Top panel: (left) cluster- 
ing using Morisita-Horn dissimilarity measure T>mh given by (3.3); (middle) bootstrap estimate of the 
95% confidence interval of the Phrobenius norm of the T>mh dissimilarity matrix; (right) dendrogram 
corresponding to the upper bound of the 95% CI T>mh dissimilarity matrix. Bottom panel: hierarchical 
clustering according to the parametric mutual information dissimilarity measure "Dm I given by (3.6). 
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Figure 3: Pairwise comparison of dissimilarities under BPLN model. Lower triangular panels: 
pairwise dissimilarities plots obtained under BPLN models fitted to mice data for the four different V 
measures discussed in Section 3.1. Local (loess) regression curves were added to the plots for better 
readability. Upper triangular panels: R 2 values for the corresponding linear regressions. 
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Figure 4: Morisita-Horn {T>mh) dissimilarities under parametric and non-parametric models. 

Scatter plot along with the local (loess) regression curve for pairwise dissimilarities between eight repertoires 
computed under parametric (x-axis) and non-parametric (y-axis) models using Morisita-Horn index as given 
by (3.3). In the non-parametric case the joined probabilities pg(k,l) are estimated by the corresponding 
joined frequencies. 
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Figure 5: Repertoire dendrograms and their confidence bounds under non-parametric mea- 
sures of dissimilarity. Dendrograms under various dissimilarity measures obtained from the non- 
parametric analogue of the model (3.11) using agglomerative clustering and the complete link. In all 
three panel rows the most left dendrograms were obtained using the point estimates of the dissimilarity 
matrix calculated from the data, whereas the most right ones were obtained using upper 95% confidence 
bound on the Frobenius norm distribution of the dissimilarity matrix. The corresponding bootstrap es- 
timate of the entire norm distribution is provided in the middle plot as a density estimator, with 95% 
bounds marked with vertical lines. Top panels: hierarchical clusters based on the nonparametric version of 
the Morisita-Horn dissimilarity index (3.3). Middle panels: clusters based on the non-parametric version 
of the mutual information dissimilarity (3.6). Bottom panels: clusters based on the values of the Shannon 
entropy function with no direct pairwise comparisons. 
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